Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics
نویسندگان
چکیده
The existing word representation methods mostly limit their information source to word co-occurrence statistics. In this paper, we introduce ngrams into four representation methods: SGNS, GloVe, PPMI matrix, and its SVD factorization. Comprehensive experiments are conducted on word analogy and similarity tasks. The results show that improved word representations are learned from ngram cooccurrence statistics. We also demonstrate that the trained ngram representations are useful in many aspects such as finding antonyms and collocations. Besides, a novel approach of building co-occurrence matrix is proposed to alleviate the hardware burdens brought by ngrams.
منابع مشابه
TrWP: Text Relatedness using Word and Phrase Relatedness
Text is composed of words and phrases. In bag-of-word model, phrases in texts are split into words. This may discard the inner semantics of phrases which in turn may give inconsistent relatedness score between two texts. TrWP , the unsupervised text relatedness approach combines both word and phrase relatedness. The word relatedness is computed using an existing unsupervised co-occurrence based...
متن کاملExtracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD.
In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations acros...
متن کاملExtracting semantic representations from word co-occurrence statistics: a computational study.
The idea that at least some aspects of word meaning can be induced from patterns of word co-occurrence is becoming increasingly popular. However, there is less agreement about the precise computations involved, and the appropriate tests to distinguish between the various possibilities. It is important that the effect of the relevant design choices and parameter values are understood if psycholo...
متن کاملLearning Lexical Properties from Word Usage Patterns: Which Context Words Should be Used?
Several recent papers have described how lexical properties of words can be captured by simple measurements of which other words tend to occur close to them. At a practical level, word co-occurrence statistics are used to generate high dimensional vector space representations and appropriate distance metrics are defined on those spaces. The resulting co-occurrence vectors have been used to acco...
متن کاملThe role of word-word co-occurrence in word learning
A growing body of research on early word learning suggests that learners gather word-object co-occurrence statistics across learning situations. Here we test a new mechanism whereby learners are also sensitive to word-word co-occurrence statistics. Indeed, we find that participants can infer the likely referent of a novel word based on its co-occurrence with other words, in a way that mimics a ...
متن کامل